533 research outputs found
Bridging Dense and Sparse Maximum Inner Product Search
Maximum inner product search (MIPS) over dense and sparse vectors have
progressed independently in a bifurcated literature for decades; the latter is
better known as top- retrieval in Information Retrieval. This duality exists
because sparse and dense vectors serve different end goals. That is despite the
fact that they are manifestations of the same mathematical problem. In this
work, we ask if algorithms for dense vectors could be applied effectively to
sparse vectors, particularly those that violate the assumptions underlying
top- retrieval methods. We study IVF-based retrieval where vectors are
partitioned into clusters and only a fraction of clusters are searched during
retrieval. We conduct a comprehensive analysis of dimensionality reduction for
sparse vectors, and examine standard and spherical KMeans for partitioning. Our
experiments demonstrate that IVF serves as an efficient solution for sparse
MIPS. As byproducts, we identify two research opportunities and demonstrate
their potential. First, we cast the IVF paradigm as a dynamic pruning technique
and turn that insight into a novel organization of the inverted index for
approximate MIPS for general sparse vectors. Second, we offer a unified regime
for MIPS over vectors that have dense and sparse subspaces, and show its
robustness to query distributions
An Approximate Algorithm for Maximum Inner Product Search over Streaming Sparse Vectors
Maximum Inner Product Search or top-k retrieval on sparse vectors is
well-understood in information retrieval, with a number of mature algorithms
that solve it exactly. However, all existing algorithms are tailored to text
and frequency-based similarity measures. To achieve optimal memory footprint
and query latency, they rely on the near stationarity of documents and on laws
governing natural languages. We consider, instead, a setup in which collections
are streaming -- necessitating dynamic indexing -- and where indexing and
retrieval must work with arbitrarily distributed real-valued vectors. As we
show, existing algorithms are no longer competitive in this setup, even against
naive solutions. We investigate this gap and present a novel approximate
solution, called Sinnamon, that can efficiently retrieve the top-k results for
sparse real valued vectors drawn from arbitrary distributions. Notably,
Sinnamon offers levers to trade-off memory consumption, latency, and accuracy,
making the algorithm suitable for constrained applications and systems. We give
theoretical results on the error introduced by the approximate nature of the
algorithm, and present an empirical evaluation of its performance on two
hardware platforms and synthetic and real-valued datasets. We conclude by
laying out concrete directions for future research on this general top-k
retrieval problem over sparse vectors
TF-Ranking: Scalable TensorFlow Library for Learning-to-Rank
Learning-to-Rank deals with maximizing the utility of a list of examples
presented to the user, with items of higher relevance being prioritized. It has
several practical applications such as large-scale search, recommender systems,
document summarization and question answering. While there is widespread
support for classification and regression based learning, support for
learning-to-rank in deep learning has been limited. We propose TensorFlow
Ranking, the first open source library for solving large-scale ranking problems
in a deep learning framework. It is highly configurable and provides
easy-to-use APIs to support different scoring mechanisms, loss functions and
evaluation metrics in the learning-to-rank setting. Our library is developed on
top of TensorFlow and can thus fully leverage the advantages of this platform.
For example, it is highly scalable, both in training and in inference, and can
be used to learn ranking models over massive amounts of user activity data,
which can include heterogeneous dense and sparse features. We empirically
demonstrate the effectiveness of our library in learning ranking functions for
large-scale search and recommendation applications in Gmail and Google Drive.
We also show that ranking models built using our model scale well for
distributed training, without significant impact on metrics. The proposed
library is available to the open source community, with the hope that it
facilitates further academic research and industrial applications in the field
of learning-to-rank.Comment: KDD 201
Efficiency and timing performance of the MuPix7 high-voltage monolithic active pixel sensor
The MuPix7 is a prototype high voltage monolithic active pixel sensor with
103 times 80 um2 pixels thinned to 64 um and incorporating the complete
read-out circuitry including a 1.25 Gbit/s differential data link. Using data
taken at the DESY electron test beam, we demonstrate an efficiency of 99.3% and
a time resolution of 14 ns. The efficiency and time resolution are studied with
sub-pixel resolution and reproduced in simulations.Comment: 7 pages, 13 figures, submitted to Nucl.Instr.Meth.
Drug-microenvironment perturbations reveal resistance mechanisms and prognostic subgroups in CLL
The tumour microenvironment and genetic alterations collectively influence drug efficacy in cancer, but current evidence is limited and systematic analyses are lacking. Using chronic lymphocytic leukaemia (CLL) as a model disease, we investigated the influence of 17 microenvironmental stimuli on 12 drugs in 192 genetically characterised patient samples. Based on microenvironmental response, we identified four subgroups with distinct clinical outcomes beyond known prognostic markers. Response to multiple microenvironmental stimuli was amplified in trisomy 12 samples. Trisomy 12 was associated with a distinct epigenetic signature. Bromodomain inhibition reversed this epigenetic profile and could be used to target microenvironmental signalling in trisomy 12 CLL. We quantified the impact of microenvironmental stimuli on drug response and their dependence on genetic alterations, identifying interleukin 4 (IL4) and Toll-like receptor (TLR) stimulation as the strongest actuators of drug resistance. IL4 and TLR signalling activity was increased in CLL-infiltrated lymph nodes compared with healthy samples. High IL4 activity correlated with faster disease progression. The publicly available dataset can facilitate the investigation of cell-extrinsic mechanisms of drug resistance and disease progression
- …